lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

Searching & Search Engines.html (2759B)


      1 <?xml version="1.0" encoding="UTF-8"?>
      2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
      3 <html><head><link rel="stylesheet" type="text/css" href="sitewide.css"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/><meta name="exporter-version" content="Evernote Mac 6.13.3 (455969)"/><meta name="created" content="2018-01-18 10:09:27 AM +0000"/><meta name="source" content="desktop.mac"/><meta name="updated" content="2018-01-18 4:45:21 PM +0000"/><title>Searching &amp; Search Engines</title></head><body><div>Classic info retrieval model:</div><div><br/></div><div><img src="Searching%20&amp;%20Search%20Engines.resources/screenshot.png" height="381" width="334"/></div><div><br/></div><div>Search types:</div><div><ul><li>Boolean search: terms (words) combined with AND/OR/NOT</li><li>Ad hoc query: one-time</li><li>others: browsing, recommender, social bookmarking</li></ul><div><br/></div></div><div>Search engines</div><div><ul><li>index pages: locally stored summaries</li><li>index building</li><ol><li>Crawling: follow links of pages they already know</li><li>Preprocessing</li><ul><li>Remove HTML tags</li><li>Tokenisation (split words on spaces)</li><li>Remove words like I, the it, a….</li><li>Take stem of words (e.g. “walking” -&gt; “walk”)</li></ul><li>Build an index</li><ul><li>simple — term document matrix</li><ul><li>1 if term appears in document, otherwise 0</li><li>to answer query, bitwise AND the term vectors</li></ul><li>better — inverted index (just a standard index)</li><ul><li>for each term, store a list of all the documents containing it</li><li>terms are looked up in index, docs followed using ID</li></ul></ul></ol><li>judging search engine fucntionality</li><ul><li>ask human raters to judge (because it’s subjective)</li><li>mathematical — precision, recall (need to know how to calculate this!)</li><ul><li>there’s a tradeoff between precision &amp; recall</li><li>as recall increases, precision decreases</li><li>F-measure is the harmonic mean of precision &amp; recall</li><li>precision at N only includes the first N results</li></ul></ul><li>ranking</li><ul><li>weighting scheme tf.idf: </li><ul><li>every word is given a weight for a document, some are more important than others</li><ul><li>tf: term frequency</li><li>idf: inverse document frequency</li></ul></ul><li>Zipf’s law: frequency of a word is inversely proportional to its rank in the frequency table</li><li>Heap’s law: most common words are encountered quickly while scanning, but you continue to encounter new words (so you need to keep indexing)</li><li>PageRank: absolute score for a page</li></ul></ul></div><div><br/></div></body></html>